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It has been shown that one can accommodate data (Bayes) and constraints (Max- 
Ent) in one method, the method of Maximum (relative) Entropy (ME) (Giffin 2007). 
In this paper we show a complex agent based example of inference with two different 
forms of information; moments and data. In this example, several agents each receive 
partial information about a system in the form of data. In addition, each agent agrees 
or is informed that there are certain global constraints on the system that are always 
true. The agents are then asked to make inferences about the entire system. The 
system becomes more complex as we add agents and allow them to share information. 
This system can have a geometrical form, such as a crystal structure. The shape may 
dictate how the agents are able to share information, such as sharing with nearest 
neighbors. This method can be used to model many systems where the agents or cells 
have local or partial information but must adhere to some global rules. 



1 Introduction 

There are many examples of systems where agents respond to both local infor- 
mation as well as global information. Nature yields many such examples where 
cells react to local stimuli yet carry some global instructions, such as reproduc- 
tion. The examples get more complex when the cells interact locally or share 
information. This is the case in physics when one has a lattice or group of many 
atoms where each is only affected by its nearest neighbor. In all of these cases we 



would like to infer something about the system or better, what each agent infers 
about the system. It is this latter case that we will be specifically addressing. 
The main purpose of this paper is to examine a situation where each agent in 
a network (of varying degrees of complexity) infers something about the whole 
system based on limited information. By doing this we hope to attain clues 
about the system's emergent properties, such as its dynamics, evolution, etc. 

The two preeminent inference methods are the MaxEnt [T] method, which 
has evolved to a more general method, the method of Maximum (relative) En- 
tropy (ME) [21 [31 S] and Bayes' rule. The choice between the two methods has 
traditionally been dictated by the nature of the information being processed 
(either constraints or observed data). However, it has been shown that one can 
accommodate both types of information in one method, ME [5J. In fact, this 
new ME method can reproduce every aspect of Bayesian and MaxEnt inference 
and tackle problems that the two methods alone could not address. In this paper 
we will show how the ME method can be used to infer properties of the system 
under investigation. 

We start by showing a general example of the ME method by inferring a 
probability with two different forms of information: expected valued and data, 
simultaneously. The solution resembles Bayes' Rule. In fact, if there are no 
moment constraints then the method produces Bayes rule exactly. If there is no 
data, then the MaxEnt solution is produced. 

Finally we solve a toy problem where we include global information in the 
form of a moment constraint or expected value and then introduce local infor- 
mation in the form of data. This will show how the agents infer aspects of the 
whole system using the same process yet come to different conclusions. Com- 
plexity is increased as the number of agents are increased yet the complexity of 
the process does not grow proportionately. This illustrates the advantages to 
using the ME method. 

2 Simultaneous updating 

Our first concern when using the ME method to update from a prior to a poste- 
rior distribution^ is to define the space in which the search for the posterior will 
be conducted. We wish to infer something about the values of one or several 
quantities, 8 £ 0, on the basis of three pieces of information: prior information 
about 8 (the prior), the known relationship between x and 9 (the model), and 
the observed values of the data x £ X . Since we are concerned with both x 
and 9, the relevant space is neither X nor O but the product X x and our 
attention must be focused on the joint distribution P(x,9). The selected joint 

1 For simplicity we will refer to these expected values as moments although they can be 
considerably more general. 

2 In Bayesian inference, it is assumed that one always has a prior probability based on some 
prior information. When new information is attained, the old probility (the prior) is updated 
to a new probability (the posterior). If one has no prior information, then one uses an ignorant 
prior \E\. 



posterior P now (x, 9) is that which maximizes the entropy, 



S[P, Pom] = - [ dxdO P(x, 9) log 4^t\ ■ W 
J P i d {x,6) 

subject to the appropriate constraints. -P id(a;, &) contains our prior information 
which we call the joint prior. To be explicit, 

P l d (x,9) = Pold{d)Pold{x\0) , (2) 

where P \d(9) is the traditional Bayesian prior and P \d(x\9) i s the likelihood. 
It is important to note that they both contain prior information. The Bayesian 
prior is defined as containing prior information. However, the likelihood is not 
traditionally thought of in terms of prior information. Of course it is reasonable 
to see it as such because the likelihood represents the model (the relationship 
between 9 and x) that has already been established. Thus we consider both 
pieces, the Bayesian prior and the likelihood to be prior information. 

The new information is the observed data, x' ', which in the ME framework 
must be expressed in the form of a constraint on the allowed posteriors. The 
family of posteriors that reflects the fact that x is now known to be x' is such 
that 

d : P(x) = j d6 P(x, 6) = 5{x - x') . (3) 

This amounts to an infinite number of constraints: there is one constraint on 
P{x, 9) for each value of the variable x and each constraint will require its own 
Lagrange multiplier X(x). Furthermore, we impose the usual normalization con- 
straint, 

dxdO P{x, 6) = 1 , (4) 

and include additional information about 9 in the form of a constraint on the 
expected value of some function 

C 2 : J dxd9 P(x, 9)f{9) — (f(9)) — F . (5) 

We emphasize that constraints imposed at the level of the prior need not be 
satisfied by the posterior. What we do here differs from the standard Bayesian 
practice in that we require the constraint to be satisfied by the posterior distri- 
bution. 

Maximize ([T]) subject to the above constraints, 

S + a [J dxd9P(x,9) - l] 
+/3 [/ dxd9P{x, 9)f{9) — F] \ = , (6) 
/ dx\{x) [J d9P(x, 9) - 5{x - x)) 



^Including an additional constraint in the form of J dxd9P(x, 8)g(x) = (g) = G could only 
be used when it does not contradict the data constraint |(3j- Therefore, it is redundant and 
the constraint would simply get absorbed when solving for X(x). 



yields the joint posterior, 

\(x)+(3f(e) 

p new (M) = Poid(x,e) — - — , (7) 

where Z is determined by using (|2|), 

Z = e- a+1 = [ dxd9e x ^+^P old (x,9) (8) 



and the Lagrange multipliers X(x) are determined by using ^ 

The posterior now becomes 

Pnw{x,6) = P old {x,6)8{x - x)^rpj , (10) 

where C(x,/3) = J dde^^P i d (x, 9). 

The Lagrange multiplier f3 is determined by first substituting the posterior 
into flU), 

e /3/(e) 

Pold(x,0)^(x - x)- 



dxdO 

Integrating over x yields 



C(x,P) 
fd9eWVP 0ld (x',6)f(9) 



f(9)=F. (11) 



= F, (12) 



C(x',P) 

where £(a;, (3) — > ^(a:' , /?) = J d9e 13 ^ 9 ' P Q \ d {x' ^ 9). Now /3 can be determined by 

<91nC(a/,/3) 



9/3 



= F. (13) 



The final step is to marginalize the posterior, P new (x,9) over x to get our 
updated probability, 

e Pf{6) 

P aew (9) = P old (x\9)—— (14) 

Additionally, this result can be rewritten using the product rule as 

e /9/(«) 

Pncw(0) = PoldWPold^^) ^^, , (15) 

where ('(x', P) = J d9e l3 f^P \ A {9)P olA {x'\9). The right side resembles Bayes the- 
orem, where the term P \ d (x'\9) is the standard Bayesian likelihood and P \ d (9) 
is the prior. The exponential term is a modification to these two terms. Notice 



when j3 = (no moment constraint) we recover Bayes' rule. For (3^0 Bayes' 
rule is modified by a "canonical" exponential factor. 

It must be noted that MaxEnt has been traditionally used for obtaining a 
prior for use in Bayesian statistics. When this is the case, the updating is se- 
quential. This is not the case here where both types of information are processed 
simultaneously. In the sequential updating case, the multiplier (3 is chosen so 
that the posterior P ncw only satisfies Ci . In the simultaneous updating case the 
multiplier (3 is chosen so that the posterior P ucw satisfies both C\ and Ci or 

3 The agent example 

Let us start with a very simple example: There is a class with 3 students sitting 
in desks next to each other and one professor. The professor announces that he 
has a loaded, 3 sided die and he would like his students to try to discern the 
probability of getting a 1, a 2 or a 3. He tells them that he has created this 
die in such a way that on the average, side 1 is twice as likely to come up as 
side 3. Now he rolls the die without showing them the results. He announces 
that he has rolled the die 10 times. Then he writes down how many times a 1 
came up on a piece of paper and hands it to student A, careful not to let the 
other students see it. He proceeds to do this for each of the other students, 
giving student B the results of side 2 and student C the results of side 3. What 
would each student determine the probabilities of the sides to be? Each needs 
to determine the probability of getting any particular outcome in one draw (#,) 
given the information. 

We summarize the information the following way: there are 3 agents, A, B 
and C. The die is rolled and the counts of each side are represented by, m ll m 2 
and TO3 respectively with n representing the total count so that n = J2i=i m i- 
Additionally, we know that on the average one side, si is twice as likely to be 
rolled as S3. 

The first task is to realize that the correct mathematical model for the prob- 
ability of getting a particular side where the information that we have is the 
number of sides counted is a multinomial distribution. The probability of find- 
ing k sides in n counts which yields m.; instances for the i th side is 

T} I 

Poid(m\e, n) = P id(mi . . .m fc |0i . . . 6 k , n) = — — . . . 6™ k , (16) 

mi!...mfe! 

where m = (m 1; . . . , m&) with J^., m$ = n, and 6 = (61, ... , 6^) with X)i=i ®i = 
1. The general problem is to infer the parameters 9 on the basis of information 
about the data, m'. 

Additionally we can include information about the bias of the sides by using 
the following general constraint, 



(f(9))=F where /(0) = V , 

* <% 



(17) 



where /j is used to represent the die bias. For our example, on the average, 
we will find twice the number of s\ as compared to S3 thus, on the average, the 
probability of finding one of the sides will be twice that of the other, (0i) = 2 (6*3). 
In this case, /1 = 1, fs = —2 and fi = F = 0. 

Next we need to write the data (counts) as a constraint which in general is 

P{m\n) = 5{m - rri) , (18) 

where m! — {m' 1 , . . . , m' k }. Finally we write the appropriate entropy to use, 

S[P,P. M l=-£/^(m,«log-£^, (19) 



where 



and 



E = E S^.^m.-n), (20) 

m m\...mk—0 



j ' d6 = J d6 1 ...d6 k 5 6i ~ 1 



(21) 



and where P id(w, 6\n) = P o \d{6\n)P o \a(m\0, n). The prior P id(^) is not impor- 
tant for our current purpose so for the sake of definiteness we can choose it 
flat for our example (there arc most likely better choices for priors). We then 
maximize this entropy with respect to P(m,9\n) subject to normalization and 
our constraints which after marginalizing over m! yields, 

e /3/(0) 

P(6) = P old (m'\6,n)—^- , (22) 

where 

C = J d6e"Wp old (m'\6,n) and F - . (23) 

Notice that if one has no information relating the sides then (3 = 0. 
For our 3 sided die the probability distribution is 

P ei (01,0 2 ) - I e /5(3^+ 2 ^-2)0™i0^(l - 9 1 - 2 )""™i-™2 . (24) 
Co 

However, each student only has the m' that corresponds to their side. For 
example, student A has m! x . Therefore student A must marginalize over the 
unknown information. The result is 

n— mi 

Ped0i,h) = ^-e^+^-^ef 1 ^ - 0iT~ m[ , (25) 



where C Cl is the normalization constant. This is the probability distribution 
that student A would assign to the die. Since all of the students will follow the 



Figure 1: An example structure that relates agents in a system. Here each vertex is 
an agent. 

same proper inference method (ME), we need only look at one of the student's 
solutions. Notice that all students or agents agree on some global information, 
the bias of the die and the number of total die rolls. However, in general they 
will determine a different probability distribution that is dependent on the local 
information, in this case the number of rolls of a particular side. 

Now imagine that each student's desk is at a vertex of an equilateral triangle 
(so that they are equidistant from each other). They notice that the teacher is 
looking the other way so they each glance at their neighbor's paper. Since each 
of them now have all of the information they should all come up with the same 
answers. 

Next let us create a more complex example by increasing the number of 
students. We enlarge the class by adding k students with a professor rolling a 
k sided die that is loaded in some given way. The students are arranged in a 
lattice structure such as in Figure 1. where there is one student at each of the 
vertices. Each student that is not on an edge now has six neighbors. Thus if 
they are allowed to 'look' at their nearest neighbors, the form of the probability 
distribution that each student would assign is 

4 Conclusions 

We demonstrated that the ME method can easily lend itself to agent based 
modeling. Whether the agents are skin cells, atoms in a lattice, banks in a 
network or students in a classroom, the methodology of ME can be applied in 
order to model many of these systems. Any system where agents agree on some 
global information yet react to local information should be able to be modeled 
with this method. It was further shown that the complexity of the computation 
can be kept to a minimum since we can marginalize over non-local data. 



By determining what each agent 'thinks' we can predict many properties of 
the system. An obvious extension of this work would be to apply decision theory 
concepts to the model so as to not only describe how the agents 'think' but what 
they 'choose' to do as well. This could illustrate how the agents evolve and could 
illuminate emergent behavior of the system. 

By using the ME method we can include additional information which al- 
lows us to go beyond what Bayes' rule and MaxEnt methods alone could do. 
Therefore, we would like to emphasize that anything one can do with Bayesian 
or MaxEnt methods, one can now do with ME. Additionally, in ME one now has 
the ability to apply additional information that Bayesian or MaxEnt methods 
could not process. Further, any work done with Bayesian techniques can be 
implemented into the ME method directly through the joint prior. 

A currently popular technique is to use entropic concepts on systems. 
Whether applying entropy in the thermodynamic sense or from the information 
perspective, ME can help here as well. The realization that the ME entropy 
Sm e = log C + /3F is of the exact same form as the thermodynamic entropjo 
is of no small consequence. All of the concepts that thermodynamics utilizes 
can now also be utilized in models using the ME methodology, whether it be 
energy considerations or equilibrium conditions, etc. In addition, one can get a 
measure of diversity directly from this method [8] . To see a detailed method for 
calculating £, see [5] 
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4 The thermodymaical entropy actually has a — /3. Although the ME entropy has a +/3, the 
sign is trivial as it is mearly a matter of preference in our method. We could have substracted 
the lagrange multipliers instead of adding them in 
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